智能论文笔记

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

translated by 谷歌翻译

AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

Bonaventure F. P. Dossou , Atnafu Lambebo Tonja , Oreen Yousuf , Salomey Osei , Abigail Oppong , Iyanuoluwa Shode , Oluwabusayo Olufunke Awoyomi , Chris Chinenye Emezue

分类：自然语言处理 | 人工智能 | 机器学习

2022-11-07

In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.

translated by 谷歌翻译

Bibletts是一种在撒哈拉以南非洲使用的十种语言的大型，高质量的开放语音数据集。该语料库包含每语言最多86个小时的对齐，工作室质量的48kHz单扬声器唱片，从而能够开发高质量的文本到语音模型。代表的十种语言是：Akuapem Twi，Asante Twi，Chichewa，Ewe，Hausa，Kikuyu，Lingala，Luganda，Luganda，Luo和Yoruba。该语料库是由Biblica的Open.Bible Project制作和发行的圣经录音的衍生作品。我们已经对齐，清洁和过滤了原始录音，并还对每种语言的对齐子进行了手工检查。我们为具有Coqui TTS的文本到语音模型提供了结果。数据是根据商业友好的CC-SA许可发布的。

translated by 谷歌翻译

通常通过过去的选择来告知机器学习中的评估，例如要使用哪些数据集或指标。该标准化可以使用排行榜对平等基础进行比较，但是随着出现更好的替代方案，评估选择变得不佳。这个问题在自然语言生成中尤其相关，该语言需要不断改善的数据集，指标和人类评估以提出确定性的主张。为了使遵循最佳模型评估实践更加容易，我们介绍了GEMV2。新版本的一代，评估和指标基准为数据集，模型和指标开发人员提供了模块化基础架构，以使彼此受益。GEMV2支持40种记录的数据集中51种语言。所有数据集的模型都可以在线评估，我们的交互式数据卡创建和渲染工具使得在Living Benchmark中添加新数据集变得更加容易。

translated by 谷歌翻译